For Reference 


= 
S) 
cS) 
2 
g 
= 
= 
= 
S) 
62 
Be 
Z, 
ta 
ae 
< 
et 
= 
8 
© 
a 
nN 
° 
Z. 














Gx avpnis 
UNIOASTTATIS 





The University of Alberta 
Printing Department 
TA naam tans » A dik antn 




















pABBREVIATION Of ENGLISH WORDS TO 
ate STANDARD LENGTH FOR COMPUTER PROCESSING 


— 
- aed = 
- tar by 
= } a ex 

we x (c) Richard L. Treleaven 

ao eae . 
= vt 
hoo. . A THESIS 


SUBMITTED TO THE FACULTY OF GRADUATE STUDIES 


in PARDIAL ih rapceonaie OF THE REQUIREMENTS FOR THE DEGREE 


OF MASTER OF SCIENCE 


- BEPARTMENT OF COMPUTING SCIENCE 





THE UNIVERSITY OF ALBERTA 


ABBREVIATION OF ENGLISH WORDS TO 


STANDARD LENGTH FOR COMPUTER PROCESSING 


by 


(C) Richard L. Treleaven 


A THESIS 
SUBMITTED TO THE FACULTY OF GRADUATE STUDIES 
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE 


OF MASTER OF SCIENCE 


DEPARTMENT OF COMPUTING SCIENCE 
EDMONTON, ALBERTA 


FALL, 1970 









‘a > 4 


: te 
+ TERPS a ad -9h ee ty 7 
ATT Hee E+? C) Demal Avis cl 

—* 
_ ° 
le habesties . WOTTA Van aea 
PS VaMon BTV. GERMAAR TS 
6 
' . ERs 
~ = 
~ 
7 
i ) ba ‘ é = 


[AHD: 30 VHOVOAN She OT CaRTaAMsus 


~ ‘eer & “ iy i? ~ 
SM AITBOER GaP oR “DATs DA ae 


: 2 


BOWSER OKETUIMOD HC’ Than tiATee 
ees Fe ye EY er 
a te Waa ae 
7 Fi, . 





UNIVERSITY OF ALBERTA 


FACULTY OF GRADUATE STUDIES 


The undersigned certify that they have read, and 
Pecurmerd vowunme faculty Of Graduate Studies for acceptance, 
a thesis entitled ABBREVIATION OF ENGLISH WORDS TO STANDARD 
LENGTH FOR COMPUTER PROCESSING submitted by Richard L. 
Ureleavel 1 pervulas [ulfiiiment of the requirements for 


the degree of Master of Science. 





ABSTRACT 


This thesis investigates abbreviation of English 
words to a standard length for computer processing. The 
criteria for abbreviation techniques are reviewed and 
expanded to include a requirement to divide the data base 
into logical, uniform-sized segments. Five methods of 
sectionalizing Chev data and three abbreviation techniques 
are tested on three vocabularies, ranging in size from 
6,354 to 63,316 words. The abbreviation techniques are 
tested for the effect of inclusion of length and check 
digit and the ordering of letters selected for inclusion in 
an abbreviation code. The results of these tests are pre- 
sented and analysed. The conclusion is presented that abbre- 
viation and sectionalizing should be performed by computer 
systems using English words as a data base. Alternative 
algorithms are presented for using words as unique enti- 
ties or for their information content. The results indicate 
a high degree of discrimination for the abbreviations, 
however possible improvements are presented which could 
increase the discrimination and consequently lead to a 


reduced standard length. 
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CHAPTER I 


THESIS INTRODUCTION 


ol een rOooue Lon 
Current computing equipment has increased user aware- 
ness of the computer's potential for handling a large data 


base on a random retrieval basis. Considerable interest has 





been generated because of increased storage and central pro- 
GCesct Ne e-pecedwepearerculariy direct eccess Storage, The 
SOVeRUMO.mCCUINItia Weucace and ACS potential for andividual 
USereencsma voOmlicreascdeinverest.. sWorkers an the field of 
information retrieval are concerned with these developments 
LOG three reasons: 
Veeelicnocol rece VOspeliiu, an individual to control and 
USe UG advantace the vast, amount of information 
published. 
Zeerlne dave base, COnsisti1ne of English words, is 
Complex in that theiwords are of variable size. Also 
Timi omOnvelmneCes Sarva vOmaceCcss 1G direcyu ly. 

Kee Nemo avaspaseeshnould be “Uncontrolled .as it must 
be expandable to allow for new words. 

Methods (Of processing 2 Cavasbasesare Sil lit, cCmium ule 
variable size words can be replaced by codes of standard 
length. This also simplifies the handling requirements needed 
for expansion of the data ybasevandsit reducessuhe storage 
requirements josliesdesire topinvestigsate,thesfteasibility of an 


alcoraenmelor bhempractecaleabbrevietion,or coding of English 
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words in a large vocabulary, and an untested abbreviation 
technique, led to the motivation for this thesis. While 
discovery of such an algorithm might appear to be a minor 
convuribnution todtheafieddsaof information»retrieval it is 
hoped that the resulting savings in storage and processing 


might contribute to more significant developments. 


1.2 ¢ The Probiem 

Information retrieval systems are being developed and 
improved. Their success is generally measured by the degree 
of precision and recall [21,22,23] and also the ease of 
operatsonetorrther usersteThesdegree ofs precision is»the 
measure of the significance of the information retrieval while 
recall is a comparison of the amount of relevant material 
Sevtrieved relative torthe amount of relevant information in 
Chemaavanlasc a sOperavd On can De difficult if the user is “not 
permitted to use familiar terms, hence the users reaction 
to coding must be considered. 

The reaction of many users to coding techniques is well 
exemplified in the report by the sub-committee on Coding 
Geographical Names and Terms of the Ad Hoc Committee on 
Storage and Retrieval of Geographical Data in Canada [9]. 
| "During the early days of computer technology, 

coding was an effective means for coping with 

problems of limited storage capacity, slow 

processing speed, and the constraints imposed 

by an 80-column card format. Although such 


coding was generally justified from the computer 
Ppoint° or view, it. tended to™discourage- a- user 
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from both entering and retrieving information. 
Many early unit-record and computer files lay 
essentially unused because of heavy demands 
placed on=contributersetoacodestheir idatacand 
the cryptic nature of the output. 


For this reason, many potential users of 
computerized storage and retrieval systems are 
concerned about coding. Fortunately however, 
coding is no longer a major hindrance. Recent 
developments in computer technology, including 
greatly increased storage capacity and storage 
speed, have effectively eliminated the need, 
Sovfar as the computer is concerned, for coding 
at theeinput and’ output stages....Any«computer 
time or storage saved by entering data using 
unfamiliar codes is more than offset by factors 
apLect ne vie? COSGeand. accuracy of recording the 
davayane phe l tivst pdace.wiMoreoverseathe* prime 
objective of any system should be to maximize 
effectiveness for the user and the use of 
familiar uncoded language does this best in 
mostecasesi,, .cOtherscoding masy,-may possiblyabe 
earried out at a later stage by the computer for 
more efficient processing. "* 


Consequently, ifthe, processing.of the data,can be done 
more efficiently by nonstandard internal coding, then such 
eoding as is necessary should be within the computer and 
Should NOteattec. the-,user... It is.of course. possible to 
develop an information retrieval system without use of coding 
for abbreviation. However as the file grows and processing 
time slows down because of the increased volume, it is 
necessary to give careful consideration to ways of reducing 
the space required to store the data. The natural evolve- 
ment of a large data base creates a need for more core, 
more on-line storage and faster processing. 


* Brisbin, W.C., and, Ediger., N.M.s (Editors), A National 


System for Storage and Retrieval of Geological Data in 
Canada, A report by the Ad Hoc Committee on Storage and 


Retrieval of Geological Data in Canada, 1967, p 43. 
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The successive demands for more memory, followed by 
more memory and an enlarging data base which again 
requires more memory, is a cyclic movement which could be 


partially alleviated by nonstandard internal coding. 


Many studies are initiated (as the geological one 
will undoubtedly be) with some form of internal coding 
schemes. However most of the major information retrieval 
Projecusiresearched forethise thesis, PATRICIA, «LEADER, and 
CONVERSE? ([20ds5y (1 0dseill2d,)eewithethe! exception ofsINTREX 
and SMART ({15], [21, 22, 23]) make no mention of the employed 
coding schemes or-techniques. These are test information 
retrieval systems based on queries against some combination 
Of tif£le abstract. corotubletextosearching: Bhey are 
predominantly batch systems with fairly small, stable voca- 
bulaties, and employ a table look-up (sequential, binary 
search or scatter storage) or a word stemming technique. 


(SMART and INTREX only). 


it 18" novt the intention of this thesis to review 
such systems. However some general observations of the 
operations of them will help to establish the key problem 


aveds 


1. All systems function with the single word as a 
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base although some do combine words and/or 
establish links between words. 

2. Most systems are currently tape/batch oriented. 

3. Interactive communication with user. provides 
Hagnaestorecall land precision: 

The most comprehensive of the above studies is the 
SMART system ([21],[22],[23]) which evaluates various 
search and retrieval techniques. These include automatic 
analysis methods, automatic dictionary construction, and 
iterative search techniques. Principal conclusions 
reached by the SMART study include: 

1. Phrase languages are not substantially superior to 
Single terms as indexing devices. 

@. soynonym dictionaries improve performance. 

ane Other dictionary types, such as hierarchies, are 
not as effective as expected. 

4, It becomes increasingly important to evaluate the 
search procedures likely to be used in an automatic 
system, particularly real-time search methods 
where the user provides feedback information to 
improve search strategies. 

The SMART study seems to indicate a trend 
developing in the field of information retrieval toward 
larger, word-orientéed, on-line retrieval systems, “Based 
on his work with SMART, Salton [5] contends there are two 


problems facing large scale, on-line information retrieval 
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systems. These are the small amount of internal storage 
Woech Geananormally be allocated to any given user,.and 
the rudimentary nature of the input/output console equip- 
ment likely to be available to each user. This permits 
ENee i necoduclilonlOrewivuorawalsoclon ly limitedsamounts of 
informat:.on.. 

ASDOSSiD Lesso WUGLOnILOethe console problem lies with 
the evolution of the cathode ray tube console and the develop- 
ment of interactive systems which permit the user an 
ino ire | iedimsoOvem hts ipossibe ehoutput's vellhe sstiorace tprobil em 
LSelat enorcmeserrOuc, srorecne economical: avadlabiin tyson 
PSUCh el Mtem clan Syscems wie. Dewend (tie is~each Ohimany «on 
those desiring an on-line information retrieval system. 

ouch) aisystem could ibe Ysimplified rf the words could 
be compacted and standardized. lence the need for an 
algorithm which can generate, within a computer, an abbreviation 
COdGe LOL words Of. large “vocabulary 7 dovit “ina smalt 
amount of computer time, and maintain the uniqueness of the 
codes so there is a one-to-one correspondence between the 


word and the generated code. 


1.3 Objectives 
Bourne and Ford [6,7] have done extensive studies 
On Various Codine techniques ang have applied them to 


vocabularies of 2082 subject words and 8184 full names. 
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Both references cite the same study in which thirteen 
coding Techniques are employed to develop abbreviations 
Oot peehonge entonsiee lie: 2deall_zedeobjectives of their 
Studyearecuas follows: 

Piel oCemnOrOms OULU OG COOeO TO require as, little 
SuoneaceuspDacemascs possible: 

ee noCiawOrdmenNOuUlOerecarn the same degree of 
Clecrimninatloneand Uniqueness, that it had in the 
original sample. 

5S. LE possible each word should retain some mnemonic 
Simmlari ty co the original word. (For this reason, 
the initial letter was automatically retained with 
all schemes Bourne and Ford tested with the excep- 
tion of the technique which truncated the left end 
of the word). 

4, The procedure should not rely on any prior know- 
ledge of the words which must be abbreviated. This 
does not exclude tables which provide letter 
frequency statistics but does exclude table look-up 

schemes. 

5. Le would bé 6 usertul feature ii she aboreyv ia. on 
could be systematically transformed back to the 
original word when desired. 

6. It would be a useful feature if the abbreviated 
WOLGSmCOU LG Ler USCOmec a Dasis! Lor sorting the words 


LigiegwrvecOvre cua LOnabe rica. Order. 
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This thesis is concerned with the internal computer 


code generation and use. The following objectives are 


modified from the above: 


ee 


The length of each code generated (by all techniques 
and variations) shall be a standard size. 

Whereas a high degree of discrimination and 
uniqueness is mandatory, it is recognized that the 
computer is capable of handling a small percentage 
Of aupilrcavlon. 

Since the code generated is to be employed as an 
internal representation, this thesis is not 
concerned with mnemonic sound. 

Procedures will not rely on prior knowledge of the 
vocabulary beyond the statistical nature of the 

Pe Giue hors 

Procedures should attempt to provide for systematic 
transformation of the abbreviated code back to the 
original word. 

Since most of the coding techniques produce a code 
which is not alphabetically similar to the original 
word, using the code as a basis for sorting the 
words in nearly correct alphabetical order is not 
possible and hence not a concern of this thesis. 
The generated code should have a characteristic 
which facilitates the sectionalizing of the 


vocabulary to uniform sized segments. 
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BipsuiIMmery a ue seule Intention Of this thesis to: 
Investigate the feasibility of a standard size code 
for a large vocabulary in the English language. 

Try such coding techniques against a variety of 
vocabularies; 

Examine possible means of dividing the data into 
standard sized segments. 

Report the results of the studies. 

Draw conclusions on the feasibility of coding and 
sectionalizing and suggest areas for further 


research. 
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CHARTER it 


DEVELOPMENT OF ABBREVIATION AND TESTING METHODS 


Peer Oris Cornstderati ons 

Initial tésts were done with a technique which 
considered the dual position of a letter within a word. 
The position was considered relative to the start of a 
word (or left end) but also relative to the end of the 
word (or right end). These tests were conducted on a 
subset of the Chemical Titles vocabulary (described in 
2.6). From these tests and the literature review some 
basic considerations for development of this project 


were established. 


esd.) Word Stemming 
WordusTLeMming. or che C€limination of set prefixes 
and suffixes, [21] can effectively limit the number of words. 
in a vocabulary. Obviously though, this does not produce 
a standard size for the shortened words are still of 
varying lengths. As the prime intention of this thesis 


is to produce a standard sized code for words, the 


decision to take no action on stemming was made. 


Gvlwe. DUDE LCacLon 
Initial testing showed that 1t would be impractical 


to attempt to totally eliminate duplication. Duplication 
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ak 
is defined as an identical code for two words or phrases 
that had a difference in their spelling. Considering the 
difficulty of total elimination, the capabilities ofa 
computer to handle some duplication, and the nature of the 
applications of this data it was logical to minimize 
duplication rather than try to eliminate it altogether. 
Most applications using English words as a data base employ 
a table look-up. They usually seek to improve the speed 
of this look-up with an initial search over a small, often 
used, subset of the vocabulary. When using a coded version 
of English words, this small look-up can be expanded to 
include duplicates. Hence the penalty for using codes, 


duplication, .can beyminimized:. 


253.96 Decoding 
There are no practical algorithms which can 

reconstruct an abbreviated code when the code is based 
On Devrergice ection or ‘deletion. This is Decause 
of the missing letters. In establishing a standard size 
of the, code thisusivuvatbionecanebe eptamanen if the standard 
is sufficiently large. The standard determines the maximum. 
length of the words which need not be abbreviated. The 
problem of distinguishing between abbreviated and non- 
abbreviated words must be handled by the code itself. The 
code must contain an indicator to differentiate between the 


possible representations. 
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2.1.4 Modifications to Techniques 

As outlined in 2.2 there are many abbreviating 
Gechniques and variations of them avallable. It is not 
the intention of this thesig/ to repeat tests on all such 
techniques but rather to test the more successful tech- 
niques on varying vocabularies. As a side measurement an 
attempt will be made to gauge the effect of two basic 
modifications. These modifications are the generation of 
auckeeck dieiv ands the Incluston-of the Length of the 
Original word. The check digit 1s used as a means of 
TrevetinemsOMmemitOrmaulon of tne Letters eliminated? The 
Tenernemod i. cavronwis of distinct value as a means»of 
Maintaining discrimination when different suffixes and/or 
prefixes are removed and the same stem results. To gauge 
CicmerhoCcumOLmU Me Sch mOUtuECayLOns . each Lecinague= 1s*used 
Pour times —- technique alone, technique with length code, 
technique with check digit, and technique with length code 


and check digit. 


eels) Lnctuding the* Letter Position in 2ne Code 

If the letter frequency is significant in“a 
position then in selecting that letter for the abbreviation 
Possibly the positvon-of tne lettver=snould also ve 
included. Initial test codes were generated by selecting 
POS LOULr sellaracecis. OL easy trequency in? thelr position 


- a Summation of frequency of appearance in that position 





\ : 7. met 
- : ( TS ; 7 
. dost : ie" 0 eas dees ke ; 
SoVUOT OOS) Go) aatOcwee f Lioom ya ee \ 4 
t P. x 
' : - 
oc ee. on. Pets - 
eS Ty tna S48 o2srs .s i Dephieso énr 1 7 
/ " Pees by 
i ) = ; iJ 
er . ae ee eee - = oh ¥ 
tsva not to eco lest tay Gee esi psaaess 
i | 


\ » = dose q kes aud ‘igh 
pode a ee Ty f +S s0 Ht =f9 ly ; 
{ a + +he 4 “ he “A ca | 
’ ’ 4 ars tas = oh od ww A 
; ; ‘ , ™ ‘ f ‘ 
= . ZL VGSCOV, ARS ww 23a oe 
4 = 
‘ ie oe : i - 7 Hope’ el 4 : 
oe : le a al i, “ae Lip er J ky Stee => 
— 2 


— > a eee ee 
& 2m bed [oh Geant ., seo0s.geolievorm : 


. +. 
f ! ie) a Pan BES tae q 
—_— } 
‘ - ‘Peg ee 
5 wit z A) at BLS ne) 
’ 
ia 
“ 7 + j ~ . 7 
¥ e +4 =f , rH 
= AE A o bi nw Mito a ? = i, + » Ty 
7 1 
- ty z 
ms SS = I 
i ,- t + ; =e x 
4% Ww af te ke i? 
i 
i 
r 7 


; = 
7 - 
¢ tS 5 Ce Lore _ 
6h 9SGL85 2 ISS Lhi3 . Saw H 
a _ 
‘ - 
— a 
_ : 4 ~ 
. OLR Separi = Fame Sooa 
*@) 
5 ee, take! aly 1 oe eee + 7 
A i 3 PP BS ant SUD? J Dw eS UE Ss, » A 
mi : 
7 i< 


~ N 
, dL5i2 #2o8:19 fae oe 


« 


; = a AS nf r —— ad _— * | i) ia ak. rm . = et re - y “ol . 
piebeD 2a7 AL nolyiaoh tedied, Sas Siinteagl csc. th a : 


ot -eodpupetet! exter 


c ao pike 


dada 75 





Ls 
relative to both ends. A 9 bit code for these letters, 


based one levter and position from Jert end of the: word, 
was combined with a code for the length of the original 
word. For example if the letters H C L M were selected 


from the word CHEMICAL the 40 bit generated code would 


be: 

LETTER POSITION SUBCODES SIZE 
length fe Ne oy then 
iste EGS 2 (2-1)x26+8 = 34 9 bits 
Ces, 6 (6-1)x26+3 = 133 G-bits 
Tt 2.) 8 (8-1)x26+12= 194 Gybits 
M (13) 4 (Henjyx26+13= 91 9 bits 


Pew occas Werle "encOouraring as there were 5/ duplicates 
in codes generated for the 4600 word vocabulary, or 1.2% 
duppicatston.,. 

Mies UWenvy six alpnabev Characters can be coded in 
Mioioonmomcevinee Of 4 bits Over allowing 19 positions: to 
hold twenty six characters which would require 9 bits to 
hold the number less than or equal to 19x26 = 494. The 
coding using 5 bits was tested on the same word sample by 
selecting the seven least frequent characters. The 5 bit 
representation for each letter was combined with the 4 bit 
length code. The result was 6 duplicates or .13% dupli- 
Caton. stron Voiss DOING son Lhe posiulon of a letter 


Within .a word was not employed in the generated code. 
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2eteo oize of Standard? Code 

The standard size was initially established at five 
bytes (40 bits) and was maintained through preliminary 
testing. ' The’ five’ byte code produced no duplication, when 
tested with check digit, on the 4600 word subset of the 
Chemical Titles vocabulary. An attempt was made to reduce 
the size to four bytes (32 bits) but the results were very 
discouraging. Where there were duplicates before, these 
increased threefold and duplicates appeared where there 
were previously none. As the intention was to use signi- 
ficantly larger vocabularies the standard size was left at 


five bytes. 


2.1.7 Sectionalizing the Data Base 

The nature of the use of this data is’ such that it 
will be used in a comparative manner which implies a table 
search requirement. If the table is stored on a direct 
access device, the generated code can be used as a key, 
assuming the file can be organized on an index sequential 
basis such that records need not. be, scattered) acrosssa 
large space (allowing for 2*° - 1 records). If there is 
room in core or there is no index sequential access method 
available, logically it would be easier to handle the data 
if it could be manipulated in standard sized subsections. 
Hence the study was expanded to explore the possibility 


of finding an algorithm to divide the data into logical 
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segments. 
This would be similar to generating a hash code 

(see 2.5) for scatter storage retrieval techniques 
(Morris, R., [19]). However the key generated for each 
Ofsetneseecodes wouldsbe tdenticalytosether keys.) Since 
Ee Ineenvion sor ithe hash rcode 1s to break the data into 
LOCC IMUNLGewOnenaLriy sunt LOPm SiZe.CO permivu easier 
comparative Seanciine woUCchsduplicavions are necessary. For 
example, if the vocabulary was 25,000 words long it would 
Gatemom Loa pass Danary search COMLOCAaCeManword.=— lithe 
vocabulary was evenly distributed into 256 units by a 
bas IecCOCe mULNCmOLIIOryescarch Mighty be On no mores than 125 
Words (l255x7250.> 25,000). Such @ search would be 
COMpleCoOmilay passes sOrunal i tne time, For scatter sto- 
RecemucchiiOucomuelat secu evo mln Ze pie need for 


dipuncCauvoweResoluulon, tne codevcan serve as the key. 


rl Cec OGL PaO te Numerd ce 

During initial testing a problem was encountered 
fOr. coding snumerics... “lo maanuaing a fi Vvespiumrem coeie an mom 
OL, Characters ,esOme OV Ci lan sOte Clee COC Lilt mo au liticr ic. 
would.be Necessary eu J WO,OVel Soa Le chniouece Wares Cied. 

1. Numerics would overlap within themselves by 

coding the numbers in the following manner: 

number 0 y 2 3 yy 5 6 7 8 9 
Saba tycode Pee Cant Sema Oensla fe ee Orr 29-4 30. ead 
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2. Numerics would overlap with the letters V W X 
YaZviiacneatoldowingemanner: 
Letter V W X ce Z 
Wumbers 0 ik 2 3 y 5 6 7 8 9 
S.bit code 22a lamein ebeges, oni eoiecori shri 3i 


The Chemical Titles vocabulary has a large number of 
words which are alphanumeric (mixed letters and numbers) 
and neither of these overlap techniques succeeded. In 
Poth cases duplications, such as XX and X2, or X1 and X6, 
were caused by the overlap. Examination of the vocabulary 
indicated that most words which contained numerics were 
short (six characters or less). This led to the use of a 
six bit character representation to code words of six or 
Tess characters. For words greater than six characters 
the letter overlap method for representing numerics was 


used. 


ewe pabbre yeacion Techniques 

There are many methods for abbreviating words. The 
sélection of method “is ustially determined by the intended 
purpose of the sabbreviation avel fi Whe abbreviation is, to soe 
performed manually, the criteria for mnemonic similarity 
is often importanttunkessetie yocauw@lary.is small. [2 chis 
is the case an vassigned «codeitforiveach ‘word ‘or phrase” can 
be memorized. The best examples of an assigned code are 


those used by telegraph companies for standard messages 
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such as STOP within”a message’. 

As tnemvocalbumarywenuarces™ there 1s a tendency to 
Csvall ci a set Wn luves crwruldelinés vo be followed vin 
avbrevilavine words. (here 1s no standard set of rules but 
Phenve = 54a LOUusOm Common crouna. this is primarily in 
seleccine Gnaracters "for deletion based ona letter 
frequency scale.of Some sort. ‘Two examples of such codings 
are the Soundex code [ 4,6] and a method suggested by 
Barret and Grems [3]. The principles of these two codes 
are as follows: 

PeundeG: Slane taiievne first Letter of each name 
Gounvingwirom the left. 
Cee LOD MOU Phra OU Y WOHS 
Sve oo. oO mune ero lblOWlnm numbers tO remaining 


Similar—-sounding sets of letters. 


(e.g. DARLZNGTGN = D645) 
4, Special cases 
a) if there are insufficient letters, fill 
out with zeros. 
(e.g. MORAN = M650) 
b) Drop out second letter in a letter pair 


(e.g. KELU“EY = K400) 
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c) Drop out adjacent equivalent letters 
(e.g. JACKSZM = J250) 
d). Wrop out adjacent equivalents of first 
Le GGer 


(e.g. LY@YD = L300) 


Barrett and Grems: 

1. Decide how many letters are needed in the 
abbreviation and stop eliminating letters 
Wiel Ciel emumber 1s reached. 

2. A phrase can contain prepositions (on, for, 
DV OMmeCUG) onGlarl. Gles, (a.nan., the)’. 
Treac tnese words as significant phrase words, 
by attaching them to the preceding word by 
Gropp inkeemoccadedsspaces:,,. Or Just drop them. 

CP Novem Save mUNc urs LeLCerstor each singite 
WOldmOUuECaC Diurase. = MOre Often, also save 
DiG@ietis WarcOucel, Omecaci Word tn auphnrase . 

4, Always eliminate letters in a word or phrase 
Lronev nem ri yiusemeandsside through the deic 
hand side. 

yo liwinave: THe VLEeCULers ,Ol 7a WOrdeaccordine mao 
the relative position soistmesleureGel mace 
following frequency scale. Ine highest ranking 
letters (those on the left) are eliminated 
be Sr afoaek 
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Hence; "workine= from the’righnt- hand side of the 
word, first all the E"s would be eliminated, 
Dhen- vherls and soon’ 

6. Use the abbreviation for n letters to derive 
therabobreviavion for n-1 letters. 

Qs) PORPUNe *inivlalvealbreviation.@savesthe first 
three letters of each word, and reduce the 
remainder of the words according to the 
frequency scale. 

8. For all succeeding abbreviations save the first 
constant (with preceding vowel) or the first 
letter of the word(s) and reduce the remainder 
of the word(s) according to the frequency 
scale. 

Of significance, in the Barrett and Grems method, 
is the lack of an objective to maintain uniqueness for the 
GOdes feneraved, »TAs Indicated in Bourne and Ford's study 
[6,7], for small numbers of characters remaining, the 
6éliminavion ofeletcers based on letter frequency does not 
rank as a superior method. 

ine scudyepys Bourne and=rord+{ 6, 7] cites tairiveen 
vechinloues, Lor abbreviating words of the English Language: 

1. Selective dropout (n=2) and add a check digit. 
Every second letter (n=2) is deleted and 
accumulateagsuo cereateva sche ck \diett : 


2. selective dropout by separate rankings of 
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caracver usare foreach Ketter=position. 
Each letter is given an index based on 
frequency Of appearance in*that” position. 
tne Levvers are removed inorder onmevherr 
inevees. 

Selective dropout by single ranking of bigram 
usage. 

From a frequency of occurrence of bigram pairs. 
each letter is given an index based on its use 
as a bigram with its neighbors. The letters 
SromociOovec mi mmeorcdereor  Unelr indices. 
Selective dropout (n=2). 

Drop every second character until the desired 
MuUnverMOte cChavacucto  cemarti. 

Selective dropout (n=3). 

Deop every wlrroscilaracter= unvia une desired 
Hudoem Ore characvers remain. 

DeTeCLive Uuropoul Dy scanegle ranking Ofpecvrcer 
usage. 

The wLetcers are deleted according to a ranking 
On OCC UL Le iCe=OteTeULersS. 

Selective dropout (n=4) 
Drop~6VeLry.LOURvUiacChakagwer Until whe desired 
nuMber” Of “Characters .emoin. 

Truncave righo end atid ada a cieck™ diel. 


Characters are deleted from the right end and 
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accumulated to create a check digit. 


9. Vowel elimination. 
Hliminatéesthe letterswASE,Ts0,UVe Tf too many 
characters remain, truncate the right end. 
TOA wi ohbuniLe: 
Fold the letters together in a fixed manner 
and truncate right end. 
Miao ocunocaveer tent end’. 
Delete characters from right end until the 
desired number of characters remain. 
len mrbnunCcater sett: end. 
Delete characters from right end until the 
desired numberof icharacters™ remain: 
(Sees LeOtersandwlasteconstant.of edited string 
of characters (used on names only). 
For this technique no explanation or results 
are given. 
These techniques were tested and reported [6,7] for 
varying lengths of the abbreviated code. Only the selective 
dropout of every second character and add a check digit 


proved. consistently superior. 


2.3 Techniques Employed in This. Study 
The basic techniques used for this study were: 
1. The selective dropouti by ‘dropping every second 


character. 
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2, Selection by a single ranking of bigram 
usage. 
3. Selection by separate ranking Of characver 
usage for each dual letter position. 

These techniques were typical of abbreviation techniques in 
that the first is based on an arbitrary exclusion of letters, 
the second on letter pair frequencies, and the third: on 
letter frequencies. 

The number of duplications generated by each 
of the techniques alone was used for the comparison of the 
effect of modifications. The modifications, length and 
check digit, were combined with the techniques individually 
and both at once. Hence the combination of the techniques 
and modifications produced the variations: 

1. Technique alone 

2. Technique plus length 

3. —Technique plus check digit 

4. Technique plus length and check digit. 

Check digit generation was done by accumulating 
the dropped letters (A=1,B=2,...,2=26). The resulting sum 
was tested for greater than 26 and repeatively decremented 
by 26 until it was less than or equal to 26, thus producing 
a number which paralleled the letter codings. 

For example if the letters C K Aeleaveare 


dropped from a word the check digit generation would be: 
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34+ 11 +5+9 + 25 = 53 
53 - 26 = 27 27 - 26=1eE=A the check 
aveit 
Waites the creck-dielt could range from 0 to 31 
rather than 1 to 26, the remaining 6 numbers are left for 
future use. For example ao check digit could represent 
an uncodeéd word while a 31 check digit could represent a 


Spectral uype Of coding. 


2.3.1 Selective Dropout by Every Second Letter 
Position 
Starting at the left end of a word every second 
letter was dropped until the desired length was reached 
Mroruunicowstudye GO. /andso letters). If there were stiii 
TOO) Many Letters remaining after all the even ones had been 
dropped, this was repeated on the shortened word. If there 
were 17 letters in the word, the first pass eliminated 
letters in positions 2,4,6,8,10,12,14,16, leaving the nine 
odd numbered positions. The second pass eliminated from 
poOstc10nNe.s,(.0l,15, a5 necessary... The following example 
will illustrate the reduction performed for each of the 
four variations. 
ELECTRONEGATIVE 
Pipst pass 


EZE@TROMEGAZIVE 
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EETOEAIE 


8 characters used for technique 
alone 


Second pass 

EETOEAIE = EETOEAILE 
= ETOEAIE 
7 characters used for technique 
plus length and technique plus 
check digit 
= EXT@EAIE 
= ETEAIE 


6 characters used for technique 
plus length and check digit 


2.3.2 Selection by Single Ranking of Bigram Usage 
The words were considered as: 
space letters ‘space 
From an examination of the words of a vocabulary a matrix was 
constructed which gave the frequency of letters as occur- 
ring behind the preceding letter or space. As this study 
permitted imbedded spaces and special characters, these 
were transiated* tol0, similar to A=l; B=2 "etc. oo They were 
then treated as a space in a bigram pair. The presence of 
a numeric was also considered as a space for bigram pairing. 
Hence the 27 x 27 square matrix had elements corresponding 
to the possible bigrams. Each element of a row indicated 
the frequency of a letter (indicated by column) occurring 


after the letter indicated by the row. The following 
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example will illustrate the principle behind the construc- 
TLONP Ol CALS] maurix. 

The sword CAT being edited for-the construction of 


the matrix would add 1 to the following elements: 


space C Ones 
C A Sia fh 
A ay yeu 
ih space AG ae) 


note 0 as a coordinate for space. 

The letters of a word were examined and the matrix 
was used to determine the frequency of each letter as it 
Was used as a DbDigram with both its nelghboring letters. 
An index was created for each letter based on the sum of 
the frequency of its use as bigrams. ~The letters were 
ranked according to this index and selection was done by 
employing those of lowest rank until the desired number of 
Letters was obtained. 

For example", to abbreviate the word CONVENTION, sum 
tne frequencies of each letter for its association with 
the two neighboring letters, rank the letters according 


vO, Unis sul, and= select the desired” number of characcers: 
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frequencies 
(from MARC ranked 
letters coordinates tapes) index letters 
C ONes Bh omeo00 9333 23g V 
0 S55 ee Lomi eee 3332-695 myi028 I 
N 15,14 14, 22 695 29 724 N 
V Ua aie ge US 29 egal 250 ib 
Ey t220 4 Syn 22 ee Lie E 
N pie the OMS 5137) 922 N 
meee 2020219) wees 71 393 764 @ 


I 20579 oak B9Se asc bk N 
0 cals 15,24 31300095 L013 O 
N 15,14 Le a0 695 e274 969 0 


2.3.3 Selection by Ranking of Letter Usage by 

Dual PosiuLon 

From an edit of the words of a vocabulary two 
Matrices were constructed which gave the frequency of 
character usage by position (note maximum of 19 positions 
permitted). In the first matrix row 1 corresponded to 
Dose Lon sas toe wlett most character of the, word; row 2 
Corresponded to position 2 as the position to the right of 
position 1, and so on. The second matrix considered 
position 1 as the first non-blank encountered by starting 
at the right end of the field of 19 characters (the word 
or phrase is left justified in this area) and moving to 


hewlett. Meositiconwes win the second=matrix jiwas considered 
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eu: 
SomuneepesluLOnevemune@lert of position lL and, so on. For 
example the letters of the word CAT would have the 


following coordinates: 


Matrix 2 MatrDxelT 
C 1 5e5 3348 
A 2508 on ga. 
ry 3429 1320 


When a word was submitted for abbreviation an index 
was created for each character by summing the two fre- 
quencies of that letter in that position. The letters 
were ranked by this index and characters were selected 
according to this rank. Low rank letters were selected 
Perec we ueuvLero fOr tie abbreviation of the word 
CONVENTION would be chosen from the top of the ranked 


letters in the following example. 
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frequencies 
(from MARC 
coordinates tapes) 

Matrix lavcrix. Matrix Matrix ranked 
letters gl LE m3 La index letters 
@ lehseeglOca3 290 76 366 V 
Ak D D4 LD 339 95 534 N 
3,14 8,14 214 10 7, 32 J! O 

Gee 29 43 72 G 


Se woome2 “Teo Dormicit were5 09 
Geli guhzelinegs 207° 193 - 400 
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ho 
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I S09 cores) 204 560 764 
O On] 5 ate 6 sh 86 237 343 


He © fH 2 & 


N vO are aNpate! 110 274 384 


2.3.4 Alternative to Selection of Letters by 

Lid iLees 

In both the bigram and letter/position 
techniques, the Jetters were assigned an index, ranked, 
and the letters of the abbreviation were ordered according 
GO vhis rank 3. hes wasqcalled tselection, ofieletters'. 
Another possibility was a simple deletion of letters with 
high indices and the use of the original letter order. This 
‘dropping out of letters' could take advantage of the 
original spelling of words to minimize duplication. Both 


of these methods of ordering the selected letters were 
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tested on all vocabularies. 


2.4 Coding Formats 


A standard code format was used for each variation 
OL the three techniques. Since the codes to be generated 
were for internal use, those words that were small enough 
wo bE coded without abbreviation could not be ignored. 
They had to be coded and checked against the abbreviations 
for larger words. The formats were established with the 
POPVow nemeuLdelines - 

i. When the length was present it was the left most 
fourolts. 

2. When the check digit was present it was the left 
mosty characters (fiverbitvs) += bits 1 = 5 1f length 
was not present,bits 5 - 9 if length was present. 

3. Words of 6 or less characters were coded using 
a 6 bit character representation. 

4, If space permitted, words of length 7 or 8 were 
eoded using 5 bits per characters with no 
abbreviation. There was not enough space for 
length 8 when the length code was used. 

5. All unused fields were zero filled. 

The following coding formats were used. The number 

in brackets bDehind.~thenvarilationyindi¢cates the number of 
characters used in thepdefinitionsofrasshort word for that 


VarLacLon . 
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2.4.1 Technique Alone (8) 


as 


ae 


4, 


4, 


1. number characters < 6 


Cpe NOU Dit character tie lds 
ee De teunused 


2. number characters 7 or 8 
Couee>S Sitechearacters fields 
3. mumber characters > 8 (abbreviated) 


8 - 5 bit character fields 


2 Technique With Length (7) 
1. number characters < 6 


ives 4ebic Length field 
6 - 6 bit character fields 


an. number eharaccersa= ty 
oe oe Doe lengtheriteld 
Waa 5eDao character fields 
i — Pl bit unused 


3. number characters > 7 (abbreviated) 


1 = 4 bit length field 

(lengths > 15 reduced to 15) 
{ — =>) Date Character fields 
letzent obtaunused 


3 Technique With Check Digit (8) 
1. number characters <s 6 


6° =a086 bbtiecharacterefieid 
i = Abit unusedt 


2. number characters 7 or 8 


8 - 5 bit character fields 
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3. 
number characters > 8 (abbreviated) 


1 - 5 bit check digit field 
7 = 5 bit character fields 


2.4.4 Technique With Length and Check Digit (6) 


Ae 


number characters 6 


omen uelencthotterd 
6 =- 6 bit character fields 


number characters = 7 


1 - 4 bit length field 
te saned Dis character fields 
eee OL Cer lnused 


number characters > 7 (abbreviated) 


- 4 pit length field 

- 5 bit check digit fields 
5 bit eharacter fields 

- 1 bit unused. 


HOH PH 
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2.5 Sectionalization by Hashing Methods 


Coole ashing eMeciods 


Hash codes are generated by performing some nu- 


meric manipulations on a field of information such as a 


name or word. 


These manipulations produce an identifier or 


hash code (usually a number) which is directly associated 


with the field. This code is not necessarily unique among 


a collection of hash codes generated by the same method 


from fields ofevnessameuuype. 


There are many possible means by which a 40 bit field 


can be manipulated to generate a key [19]. Hashing on the 


abbreviated code was performed to see if a smaller 


(bats itvetdds) "8 < gdadsscsido edit a 


Sferkt Fete aseto tito 2° - IL 
abisk? tefsaredo vid 2a = F 


) ¢fgtd@ wood? Bae Hiégmed detw sup! sHiQet b.A eS = 


re) eryevosiaio tedmd Jf 


*. «2 BAO ii + ~ if 
5 “ ra) x bio - 
‘OLseid Tetoststig Vio ci ’=) 5 
SS irili ne od f ot J 
ISLVSETdTE ~< BFS 70ES6R0 , SSengil 12 

am f ro inl A r 

aries Je ” = ach 

faab send Jie 2. - tf 
es . ¥ ~ ip ‘ay >" « 
TaittaNtede Ofte - o = 

Maun g26.f° = 


- no eS te hae 


yom antresh yd Ho tIa sti ands ot c.$ 


6 ee dove nolktemiotnd t6 BPs 8 20 ano lsalyalnamr ofan 


soiticnebil as anvhorqg saobuaiugican sear .btoW I¢ sree 


bayeiooses yiteorth al dotaw (tocar & Yilaven). eben ‘iesit 


saad su ptnw Wiiaeeeonen dor ak shod etd ea 08 Sach. 













M 





32 

identifier (key) could be generated which would provide 
a uniform distribution for the associated abbreviations. 
This was tried with five methods which were based on the 
structure of the abbreviated codes. They were used on all 
vocabularies, to ensure that) the resulting distribution was 
NOW ia «product.of,.a single.data base. 

The five hashing methods. were: 

1. First 8 bits (one byte) were used as a key. 

Thats nincludedslength plus four bits of check 
digit or check digit plus three bits of the 
firec letter. 

e-eerirsts Lu. Dits were Used as a key. 

This~ineluded length plus check digit plus 
frrecerevvermorecheck digit plus first letter 
plus four bits of second letter. 

3. The five bytes were summed and the result was 
used as a key. 

nis was done without consideration of content. 

4, The five bytes were summed and the low order 
CreciicmorvomOresloeeresult were used as a Key. 

Working with the sum from method 3 above, the 
high order bits were stripped off to make a 
smaller key. This would be the same as summing 
the 5 bytes, dividing by 256, and using the 
remainder as a key. 


5. The length and the 5 bit character representation 


8 otan 00 r 
ry 7 a re : 
| » Yeu 139i Lam 
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were summed and the result was used as a key. 
Thtsmignoreds tne absence of the length code 
for abbreviations generated by technique alone, 
and technique plus check digit. 
For example the abbreviation for SEEDLINGS using 
selective dropoutmoneevery=second letter plus length code 


and check digit would be: 


Abbreviation : SZEZLINGS 

length check digit letters 
9 R Sele oN 7G 7S 
9 18 19527 19 


Hash codes: 
dine! - Gelatin,  oalyag yea welalieres 


Binary representation of 9 and R 


9 R(18) 
1001 10010 
hash code POOL = 1001 

or P55 


Persie oO Les 
Binlaryoeneprescnvavion Of 99, Rh. andus 
9 18 and 19 
1001 10010 10011 
hash code LOCT LOU Ter AO Ud. 


or 9811 
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Basvoumatnespivenbyves. 


Binary representation of the abbreviation 
9 18 19 p Le 14 i 19 
LOOT 20010 (MOOS OOFOLPR0LOS OR s01Ad Ons 00241 ei 1001 
note 39 bits - 40th is 0 
Byte representation 
POOLS LOOM CUO OSATO0SF10NG BOl0ts Oldd 0002s) 1110,0110 
1538 48 lege aS esd 
hash code 01011001010 


or 714 


4, Sum the five bytes and use the low order 
eight bits 
The sum from method 3 has the 11 bit 
representation 
02. G1 CO 10 10 
Eliminate three high order bits 
hash code ©11001L010 
oris202 


Sup cum thenlenethmandel ive sbit scharacter 


representation 


Word Binary Decimal 
length 9 2004, 9 
check digit R 10010 18 
S 10011 19 
E OOLOL 5 
L 01010 ie 
N 01110 14 
G OO111 if 
S LOOT 19 
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Sp) 
hash code 10011001 


or 103 


eo .c e Methoe tor myaluacing’ the Distribution 

For each coded word in each vocabulary a key was 
generated by all hashing methods. Because of the limi- 
CauLons Off woe size of the key it was possible to construct 
tables which counted the occurrence of the different keys. 
For example’, the summation of five bit groupings would have 
eiemcoOLlOWw Neel mi tation. “~The first four bits of length 
Mad tosoe a number Jess than or equal to 15 (2* - 1). 
Each of the seven character representations could not be 
a number larger than 31 (2° - 1). Hence the maximum hash 
value that could be generated was 7x31t15 or 232e. 

For the summation of 5 bit groupings, a table was 
used which had 232 elements and maintained a count of the 
number of times each hash value was generated. If the hash 
value of a code was 136 then element 136 in the table was 
incremented by 1. When the entire vocabulary had been 
read, the mean distribution for each table was calculated 
by dividing the total number of words by the number of 
entries in the table. This figure represented the number 
each element of the table would contain if the hashing 
method created random hash values. To provide a basis for 
comparison of the distribution tor eachemethod, the 


standard deviation for each table was calculated using the 
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root mean squared method 





2.6 Data Base 
2.6.1 Vocabularies 
The abbreviating techniques were tested on three 
varying vocabularies. The vocabularies ranged in size, 
content and degree of correct spelling. These vocabularies 
were taken from: 

1. Chemical Titles Tapes [24]: These tapes were 
erocucedmuvo provide title, auvhor and identification of 
articles recently published in the field of chemistry. 

The different words used in article titles were collected 
from these tapes for the years 1964, 1965, and 1966. This 
was the largest vocabulary, contained the highest degree 
of spelling errors and many numerics. 

ee nc apeem@in lel semnesentapes sare: Clrcilatedupy, 
the Library of Congress and contain author, title, short 
abstract and identification of books recently catalogued. 
Four of the MARC tapes available at the University of 
Alberta were edited for any alphabetic character strings. 
The resulting vocabulary contained many short words, no 
numerics, minimal spelling errors and imbedded special 


characters. 





3i/ 
3.» Canadian Dictionary: One-quarter of the 
Dictionary of Canadian English [ 2] was keypunched to 


obtain a representative sample of a vocabulary which 
contained a large number of closely associated words. 

This close association occurred when two words had a 
Getterence In alphabetical order*=orf one™or two characters. 
mo provide this, the words “trethe right hand column ‘of the 
odd numbered pages were punched (the words - not the 
definitions). This was the only vocabulary to contain 


imbedded spaces. 


QnO ae menbuc. alm cattons 

1. Words or phrases of any length were accepted 
Duce runcavedmvOul9) wcharacters., sThose of less 
than 19 characters had spaces added which created 
a field that contained a left justified word. 
iicseelene une reserLeclon Was, arOictrary and could 
be modified. 

2. Imbedded spaces and special characters were 
permitted. The only special characters employed 
were the hyphen (-) and the apostrophe ('). 


3. Numerics were permitted. 
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2.6.3 Statistical Nature of Vocabularies Used as 


the Data Base 


Mable 2.0.3.1 


NUMBER OF DIFFERENT WORDS AND AVERAGE LENGTH 


Average 
Vocabulary No. Different Words Length 
Chemical Titles 63,316 8.2 
MARC Tapes 6,354 Ay 
Canadian Dictionary 10,804 8.4 


Comment: 


1. Canadian Dictionary contained many phrases 


ee Chemical Titles was very technical. 





Length (633 
Now 


XO COT ONT 4] G0) DD 


COMPARATIVE RANKING BY 


Table 2.6.3.2 


DISTRIBUTION OF WORD LENGTH 


(with percentage figures) 


Chemical 


Titles 


ye 


OOM OH WO VUTIDNUI Pp FOr O 


16 words ) 


cS 


MARC 
Tapes 
(6354 words) 

No. fo 
0 0 
Oa 44 
ae 4.9 
668 A085 
878 i370 
1045 voe4 
954 15%0 
IAS Poe 3 
554 San 7 
eye! 5EY 
241 3.8 
Lee 2a 4 
165) Ig 
sul 5 
20 05 


BQ 


se 


Canadian 
Dictionary. 
(10804 words) 
No. % 
0 
59 
228 2 
651 6 
904 8 
Sey 10 
1270 atk 
1436 13 
ae 12 
1196 ii 
866 8. 
634 5% 
Paka 3% 
249 en 
391 an 
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Canadian 
Dictionary 
(10804 words) 


Rank 


MARC 
Tapes 
(6354 words) 

Rank i 


Table 2.6.3.3 
(with percentage figures) 


COMPARATIVE RANKING BY 
FREQUENCY OF OCCURRENCE OF INITIAL LETTER 


yh 


Chemical 
Titles 
(63316 words) 
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Tee a .2)..63..3). 4 


COMPARATIVE RANKING OF 
THIRTY MOST FREQUENT OCCURRING BIGRAMS 
(with frequency figures) 


Chemical MARC Canadian 
Titles Tapes Dictionary 
Bigram Frequency Bigram Frequency Bigram Frequency 
ON 8462 ER 195 ER 7s 
sae 8416 IN TAO IN 23 40 
ER 8014 ON 695 AN 1276 
NE yloyiat AN favaliyé ON cabal 
EN 7863 EN Die TE DTG 1 
AN 7428 ES B35 sent 1141 
AT 7360 RE S21 AT igs gS | 
ne hele Hal 508 RE 1066 
TE eld AR 496 EN 1008 
RO lay i AL 420 AL 1000 
ES 6442 LE 418 LE 981 
RI 6278 Rr 4i4 oy 958 
AL 6242 RA 399 AR 935 
RA 5634 TE 595 RA 858 
iron 5468 AT 387 NT 832 
AR 4799 oT 315 is S17 
IO 4749 NT etal OR 815 
LA 4666 NG 359 RI 815 
OR 45715 CO 553 IC 736 
NI 4487 LC 320 final 694 
iy 4438 IO 318 eT 692 
RE 1355 Is 316 RO 690 
LE 4298 MA 306 LA 645 
TR H2u3 EL 288 DE 644 
NO POL, NE 288 NE 641 
NT 4014 Li 285 (18) 636 
DI 4007 LA 283 ES 628 
Sub 3875 NS 281 MA 598 
[jel 3842 CH PCr IO 584 
eK 3835 SE 263 NG 581 


These frequencies are independent of position except 
that the first and last letter can only appear in one 


bigram - all others appear twice. 
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(89915 letters) 
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Letter 


MARC 
Tapes 
(42474 letters) 
yh 


Letter 


Maple tee 73.5 


COMPARATIVE RANKING BY 
FREQUENCY OF OCCURRENCE OF LETTERS 


(with percentage figures) 
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CRP Re 2 


RESULTS AND OBSERVATIONS 


3.1 Results 

The results of the work for this thesis are best 
measured by the number of duplicates produced by the 
variations of the abbreviation techniques. Table 3.1.1 
provides the count of the number of duplicates for these 


Walled a GLOns « 


ables nc wk 


NUMBER OF DUPLICATES FOR ALL VOCABULARIES 
AND ALL VARIATIONS OF TECHNIQUES 


Chemical MARC Canadian 
Titles Tapes Dictionary 
(63,316 words) (6354 Peron icgol words) 
Selective A L C C&L ALC C&L A Lc CéL 
dropout 
of every T2001 461, 2439287 ce Mey (0) POe2 7a 
second 
letter 


Bigram SV Oomeangoeley “4 "6776.5 0 Oem Ome 
selection 


Letter/ 

position Taste (om eh? Welebem WT! Wey isis {epi Ty™1972 3 

selection 

Legend: A -—- technique alone; L - technique with length; 
C - technique with check digit; C&L - technique 


with check digit and length. 
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The words which are duplicates, with the exception of 


those for the techniques alone and techniques plus length 
for the Chemical Titles vocabulary, are listed in Appendix 
A. The matrices which contain the frequencies for the 
construction of indices for the bigram and letter/position 
techniques appear in Appendices B and C respectively. 
Additional results are more relevant in the following 
section with the analysis of the related observation. These 
inevude the numberof duplicates for*the comparison of the 
two ordering methods and the mean and standard deviations 


for the hash methods. 


3.2 Observations 

3.2.1 Vocabularies and Duplications 
Observation 

Duplication can only be economically avoided in small 
vocabularies. 
Analysis 

Table 3.1.1 illustrates that only the MARC tape voca- 
bulary (the smallest - 6354 words) produced codes which are 
free of duplication:* As pointed out In Chapter 2, this 
was expected and should be considered during the design 
stage of any system using words of the English language as 


a data base. 


aD 
3.2.2 Effect of Imbedded Spaces and Special Characters 
Observation 
In the MARC tapes and Canadian Dictionary vocabularies 
there are duplicates that are valid synonyms, This is 
because they contain imbedded spaces and special characters. 
Analysis 
When spaces and special characters are deleted from 
tne, original word, they have no effect on the check digit 
because they are coded as 0. If they had been coded as 27, 
these duplicates would have been eliminated. The use of 
the length code prevents many such duplications. The 
exception is COUNTER REFORMATION, where the length exceeds 
fifteen characters. 
Some examples of the duplicates caused by the inclusion 
of imbedded spaces and special characters are: 
1. From the Canadian Dictionary, selective dropout 
with check digit - 
WET NURSE | WETNURSE 
2. From the MARC tapes, bigram selection with check 
digit - 
PILGRIN'S PILGRIMS 
The following table represents the number of dup- 
licates remaining if those caused by imbedded spaces or 
special characters, as illustrated above, are permitted. 
The original phrases are considered equal, therefore a 


proper code was generated. 


Pe 
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ie ema ec week 


REDUCTION IN NUMBER OF DUPLICATES 
BY CODING IMBEDDED SPECIAL CHARACTERS AS 27 


MARC Tapes Canadian Dictionary 

A L C C&L A L C C&L 
Selective 1 0 0 0 . 0 mm 0 
dropout y 6 0 0 14 on 0 1 
Bigram 10 0 2 0 0 2 2 1 
selection 57 6 3 0 io? 16 2 au 
eee 0 0 0 0 1 1 0 0 
position 
selection 8 1 0 0 13 18 2 3 


Legend: A -—- technique alone; L - technique with length; 
C - technique with check digit; C&L - technique 


with check digit and length.. 


Note: 1. Superscript indicates number by which original 
count was reduced. 
2. Chemical Titles Vocabulary is not included as it 
contained no imbedded spaces or special 


characters. 


Seer ep eCcumOLeMocditicetacons 
Observation 

Modification of the techniques makes a definite 
improvement in the generation of unique abbreviation codes. 


Analysis 


Examination of Table 3.1.1 shows numerous duplica- 


ia = Se cae 
SOM blk Se S 





nid 
tions for the techniques alone and the marked improvement 
by appending extra information from the original word. 
For example - the number of duplicates for the selective 
dropout technique and the MARC tape vocabulary reduces 
from 8 to 6 when the original length is added to the code 


generated from the technique alone. 


3.2.4 Most Effective Single Modification 
Observation 

When making a single variation to a technique, the 
inclusion of a check digit is superior to the inclusion of 
the length. 
Analysis 

From Table 3.1.1 it can be seen that in every case 
the number of duplicates is less for adding a check digit 
than for adding the length of the original word. In the 
MARC tapes selective dropout technique, the number of 
duplicates with a check Gigit is 0 and with length code is 
Gs 


32.588 bifectooraiincitisionofeBothibengthiiand: Check 
Digit 
Observation 
In most cases the inclusion of the length and check 
digit decreases or does not affect the number of duplicates 


compared to the check digit alone. 
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Analysis 

Table ss. l.u* Shows that tkiisix out of the nine 
possibilities, inclusion of the length and check digit is 
an improvement or the results are unaffected. The excep- 
tions occur when words are similarly spelled by the same 
letters in different positions.» The addition ‘of length 
causes a character to be dropped from each word. The two 
letters dropped are different but their addition results 
in the check digits having the same value. To illustrate: 

LAPSTRAKE LAPSTREAK 

Abbreviating by letter/position selection the letter 


ranking becomes: 


Word index Rank Rank Bae x Word 
L 256 K K 256 L 
A is Ou L L 1c A 
P 506 P iy 506 P 
S 708 Ss S 708 S 
i 835 ay 2 835 T 
R Li2e R R edad. R 
A Way Te A E eee E 
K 106 A A L270 A 
E (aad oa E A 206 K 


Using 7 characters for abbreviating with check digit: 


Letters for 


Word Check Digit Abbreviation 
LAPSTRAKE E+A=62=F KLPSTRA 
LAPSTREAK A+A=52=B KLPSTRE 


Using 6 characters for abbreviating with check digit and 


length: 
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Letters for 


Word Length Check Digit Abbreviation 
LAPSTRAKE 9 EOE ATIHIPA o3) ,7262G KLPSTR 
LAPSTREAK 9 Aer cht bees (2G KLPSTR 


3.2.6 Type of Technique and Effect of Length and 
Check Digit 

Observation 

If the technique employs the location of the letters 
selected, the inclusion of length and check digit is 
not an improvement over the including of a check digit. 
It is however, a definite improvement when the location is 
movs considered. 
Analysis 

The combination of length and check digit greatly 
increases the discrimination of codes for the bigram tech- 
nique which only considers the letter and its neighboring 
letters. The selective dropout is improved on smaller 
vocabularies. In the Chemical Titles vocabulary the 
exclusion of a letter to permit the inclusion of length 
doubles the number of duplicates. Examination of Table 
3.1.1, for selective dropout, Chemical Titles shows the 
increase from 43 to 87 when the length code is added to the 
technique with the check digit. The letter/position tech- 
nique is adversely affected in both the Canadian Dictionary 


and the Chemical Titles vocabularies although not as 
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significantly as the selective dropout technique. 


3.2./ Effect of Ordering Selected Letters 
Observation 

A technique which relies on indices of letters in 
a word can be improved through ordering the letters by 
Sselectulon ratner than dropping out of letters. 
Analysis 

As explained in 2.3.4, the ordering of selected 
letters was tested for techniques with indices (the bigram 
and letter/position techniques). Table 3.2.7.1 indicates 
toe -ditrerence this ordering has on the number of dupli- 
cates. That the Selection technique proves superior is an 


unexpected result. 


iar - ; ; 
7 a 


. , . ss 5 ‘ 
TLIC ViJo9ieu-sny Sh ylitngollingle : 
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COMPARISON OF NUMBER OF DUPLICATES 
FOR SELECTION AND DROPOUT ORDERING OF LETTERS 


Chemical MARC Canadian 
Titles Tapes Dietlonanry 
A L C C&L ee Cee Lea iy CCL 


Bigram 
Selection 2765 798 127. 41 GVamOu 5 (O00 Ol) 16. 452 
(sorted) 


Bigram 

Dropout HO7 Gwe 3.216) 68 od Omeo ele O08 39 21 7 

(un- 

sorted) 

Letter/ 

Position 1186. 976 144 162 ae Gay On eas 

Selection 

(sorted) 

Letter/ 

Position 

Dropout BOO0m LOS Tes]. L52 Oe gee OLO PES, Ae) ay 

(un- 

sorted) 

Legend: A - technique alone; L —- technique with length 
C - technique with check digit; C&L - techniqued 


with check digit and length 


The main difference for the letter/position technique 
results from the employment of the position relative to the 
right end of the word. Although this does not affect the 
Letters selected for elimination, 10 does affect the sign 
of the indices. Hence the ranking arrangement of the 


letters is different. 
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Consider the words: 
ARGONAUTS and ARGONAUT 
from the MARC tape vocabulary. When the letter/position 


technique alone is employed the letters have the following 


indi¢ e's": 

A eves Dawe S45 
R - 404 R - 458 
G - 98 Gas =e nla 2 
OF =, 248 ORNs en 
Nee 359 NO = 355 
A = 348 eo 
Ue = 116 UF eas 5 
T = 354 SA Seka Mal 
S - 724 


When the dropout ordering is used the S is deleted 
from ARGONAUTS and the result is the duplication of 
ARGONAUT. 

When the selection ordering is used the shift in 
position from the right hand end causes a difference in the 


indices sufficient to alter the abbreviation code. 


Word Ranked letters 
ARGONAUTS GUOAATNRS 
ARGONAUT GUOTNAAR 


The differences are very marked in the bigram tech- 


nique because of the effect of the letter dropped on its 


can 7 s 


ay 


7 7 
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tye eee 
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neighbors.,»it is.often.sufficient) toralter’ their 
position significantly when ranking occurs. Consider the 
words: 

THEOSOPHICAL and THEOSOPHY 
from the Canadian Dictionary vocabulary. When the bigram 


dropout technique with check digit is used these words are 


duplicates. 

“ 1007 * ai 1007 * 
H 891 H 891 

E 583 E 503 

O 343 0 343 

S 466 Ss 466 

O 42 0 44a 

P AYT P 44d 

H 570 H 321 

il HOG Ou Y 3 Lie * 
6 1301 * 

A 1505—% 

L DEVAS Oe a 


* eliminated letters 
The letters remaining are the same in both cases - 
Heer Ore Heit eis” unfortunate that the check Gdipavueas 


insufficient to prevent the duplication. 


see ae ee 


20494341412 = 45 20+25 = 4 
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The selection technique creates the abbreviations: 


Word Check Digit Abbreviation 
THEOSOPHICAL 19 OPOSHEH 
THEOSOPHY ne) HOPOSEH 


The different codes are generated because of the change in 
the index forthe Letter H which is’ followed by an I in 
THEOSOPHICAL but by a Y in THEOSOPHY. The resulting change 
wietus, index trom 5/000 371) is sufficient to alter its 


position irom Liftn to first. 


3.2.8 Effect of a Large Vocabulary on the Letter/ 
Position Technique 

Observation 

The letter/position technique performs well for the 
small vocabularies (less than 11,000 words) but it appears that 
in a large vocabulary the effect of the position is decreased 
because of the increased letter frequencies. 
Analysis 

For smaller vocabularies, this technique produces 
the minimal number of duplicates with the exception of the 
addition of length and check digit for the Canadian 
Dictionary vocabulary. An examination of the duplicates 
for all vocabularies and variations of the letter/position 
technique indicates the basic problem. The technique is 
unable to discriminate between words that are simple varia- 


tions of spellings or mispelled words. The best example 
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5y3) 
of this is the three duplications in the Canadian Dictionary 


when the technique is used along with the length and check 


Gieic. 
SEPULCHRE SEPULCHER 
LAPSTRAKE LAPSTREAK 
ACCOUTERMENTS ACCOUTREMENTS 


These three duplicates also typify many of the duplicates 
in the Chemical Titles veceo any: The difference in the 
original words is merely a juggling of the characters. 
This implies that as the frequency of occurrence of 
letters rises the difference between indices from these 
frequencies becomes very significant. This difference re- 
duces the effect of the position on the analysis of letter 
Significance and tends toward a dependency upon letter 
frequency. This is noticeable even with a 4500 word 
Change in vocabulary. The words SEPULCHRE and SEPULCHER 
would still have the same length and check digit if coded 
using the MARC tape matrices however the letter indices 


would prevent duplication. 
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Word Index Rank Rank Index Word 
S 288 H H 288 Ss} 
E 598 P is 598 E 
P L358 C C T53 lf 
U 220 L #** RR 220 U 
L 133 U L 183 L 
C V6] S U 161 C 
H 76 R S 76 H 
R 302 E E 630 E 
E 448 E E 174 R 


3.2.9 'Permissible' Duplicates 
Observation 

A duplicate has been defined as two words which are 
spelled differently but generate the same code. The 
effectiveness of the techniques is greatly improved if this 
definition is relaxed to permit obvious misspellings and 
words of the same connotation. 
Analysis 

In an uncontrolled environment interpretation of a 
duplicate is very critical and requires human editing. 
PavLe S.c-Jel Contains the results of Such an edit of the 
duplicates generated by each technique and variation for 
ait vocabularies. This table indicates the tremendous 
difference a relaxation of the definition of a-dupiicate 


has on the number of duplicates. 
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NUMBER OF 'PERMISSIBLE' DUPLICATES 


Chemical MARC Canadian 
Titles Tapes Dictionary 
A* L* C Cele else. Ca Cel ee ee our © 
Selective 
Dropout 
Every 198 1350 30 64 Greste OPO 13: 2 24S ONE } 
Second y206 «1461 43 87 6266. 0uln 0 OP Bek eae 
Letter 
Bigram 840 OS0smel od B28 Ceeele OF eh 4 
Selection ayes 7a 6 eee 217k} 67 6 5 O 102) Gus 2 
Letter/ 
Position 220 420 20 18 one Tt 2G 4 Seas 0) 
Selection nee 976 144 162 a ey Gos mee 


Legend: A - technique alone; L —- technique with length; 
C = technique with check digit; C&L - technique 


with check digit and length. 


MOSS 1. * Figures for these columns are approximations based 


on"samples sof 250 words’. 
2. The small numbers indicate the number of duplicates 
when definition of duplicates is rigid. 
Editing was done by examination of the duplicates 
under the following guidelines: 
For MARC tapes and Canadian Dictionary 
Consider as ‘permissible duplicates' if: 


}. QBit tis anvicbvious @spelbbingserror ortvariation,. 
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2 me NeaWOrG. ceaebDlUunelwof the .other. 


35 elie Connotationssof the words are similar.to 
tThesnoamy thatetheireuse.issasfunctionr of 
grammar. 

For Chemical Titles 

Consider as 'permissible duplicates! if: 

Bl Loan sODVrOUsEsSpe LLingserror sor avariation. 

2. The following minimal rule for stemming the 
GighGw.end of4a .wordsapplies.— drop,.B's,.S's 
and ES's and compare remainder for duplication. 

(Because the author lacked a chemical background to ascertain 
"permissible' duplicates based on connotation this simple 


rule was used). 


322010 Reduction to Word Stems 
Observation 

Word stemming and abbreviation techniques complement 
each other. A stemming method can eliminate from a voca- 
bulary the words which are the most difficult to abbreviate. 
An unmodified abbreviation technique can assist a stemming 
method in determining the number and variety of suffixes 
Dievie ss leMud 1s.0 « 
Analysis 

As pointed, out: in) Séction 332.9), e dildercnce mus: 
be made in. using a word: for informavion content or as 4 


single entity. The edit of the Chemical Titles vocabulary, 
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which employs a simple stemming technique, points out 


EhausWwnehea wordlis used’ for its information content, a 


stemming method serves to reduce the number of duplicates 


byséliminating. the words most difficult..to.abbreylate. 


The edit of the Chemical Titles vocabulary 


indicates the significance of stemming to abbreviation 


techniques but the edit of the other vocabularies indicates 


how abbreviation techniques can assist a stemming method. 


Pne Cotesucansequently neducingi the mumber,of,duplicates, 


could have been performed by a stemming algorithm. Hence 


a stem-list could be constructed by a review of duplicates. 


Examination of MARC tapes and Canadian Dictionary vocabul- 


eL.cseprovided=ethe following possible»suffixlist: 


ING 
Ic 
ENT 
ION 
ER 
ED 
IONS 


Be LY, 


AL 

OLOGY 
OLOGICAL 
Ics 
GRAPH 
GRAPHIC 
GRAPHY 


GRAPHICAL 


Is 


ENCE 
ENCY 
ANCE 


ANCY 


ABLE 


ABLY 
PBLY 


IBLE 


ICAL 
AR 
IA 
IAN 
INE 
ANE 


LY 


F iow B& aS 
yiow 8 netiw terry 
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3.2.11 ‘Effect of Hashing Methods 
Observation 

The vocabularies can be very effectively sectiona- 
lized using a hashing technique. <A very even distribution 
results when the hash is done by summing the five bytes of 
the code and using the low order eight bits of the result 
asa key. 

Analysis 

PAavwesmoccee ll.ce ands sec cll. 3 provide the mean ard 
standard deviation figures for the MARC tapes and the 
GCenadlene DiCcuaonary vocabultariesm@ior all teehniques and 
Vomuag Mong. lap lLe 4.c.ll.) provides Similar data for the 
Chemical Titles-vocabulary but excludes figures for the 
techniques alone and techniques plus length as the number 
of duplicates generated in these instances exceeds a usable 
figure. 

im all three vocabularies gandafor alls techniques 
the best method was the byte summation, and use the eight 
low order bits of the result. it not only had the lowest 
standard deviation but this was maintained at a level of 
iets leas the sizeof the vocabulary increased. ~ihe 
other methods had increased standard deviations. When the 
standard deviation is considered relative to the mean for 
that hashing method, the remaining methods can be ranked as 
follows: 


2, 5 bit summation 
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ST TOS0T gS5ro Hy 1 


Gort (Olena Sued 
1° O€ Ouse Qugtic 
0 61 OOLE 6°€2 
a S(GRS Ave a 9<0¢ 
SAS aA scab Maire 
a Ge 90°T Gard 
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a44q s4tq g wns 
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RUS 
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a4hq 
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Saat 


ue) 


us 


qSITaA 


4T3Tp Yooyo pue 
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yasuetT sntd 
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JISTp yooyo pue 
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yasuetT sntd 
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JESTD Yooyo pue 
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4T3Tp yooyo snd 
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G9 
oA 
0°8 
Gy tpt 


78 
Cer 
Oars 
ras 
g°L 
Cheat 
wg 
FL 


cies 
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s4tq gums 


a4hq 
qSsaTH 
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60°T 9 axe ii 
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Oss 0°9 eee ORT 
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cote ee) 5 ie 
GO’T  6°S Boe ey 
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Gag 
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Gag 
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So. first eDyue 

4, Byte summation, use 11 bits 

Pe et) Serer. Es 

scanning down the tables one can see that the 
hashing techniques are almost independent of 
une abbreviation technique Or variation. Thus ignoring the 
coding technique, the effect of the hashing methods on the 
Canadian Dictionary are summarized as follows: 


Standard Deviation 


Hash Method Mean (approximate) Rank 
First 4) bits 66 a0 5 
Byte summation 

“usine Jie bits 8.5 Shue 4 
Pive, biG 

summation 46.5 Ons 2 
Byte 

summation H2.2 ae a 
using at least 

S pits 

First byte He 2 2 0 3 


The other vocabularies have similar results. 

Differences do occur when the length is employed 
in an abbreviation technique. For example, the standard 
deviation for the first fourteen bit hashing method, when 
used on selective dropout, generated abbreviations from 
6Jo-wher the length is not used, to 4c when it te. ‘This 
is because’ the location of the four bits that contain the 


length are at the extreme left of the code. It greatly 
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increases the discriminatory powers of the first byte and 
first fourteen bits methods. The continued presence of | 
high order bits smooths the distribution of the keys. 

The five bit summation and byte summation using 
eleven bits are Hegavively-afiected by the presence of this 
ecce. 

Both these methods are elementary hashing tech- 
niques and have the fault of all hashes by addition. 
During a hash by addition one must allow for the maximum 
Pessl NLemresultus (5. xX cooeteleor 15)+) 7 xo3i + lin this 
instance) however addition seldom results in hash values 
imecne Low or high ranges. If the distribution of values 
Ghatware Nasheds (Omtoyes> in-a byte, or.0 to -31.in five 
bits) are evenly distributed this will result in a con- 
centration toward the middle of the hash value range. 

The addition of a length code to the abbreviation to be 
hashed has the effect of moving away from the assumed 
even distribution of values. The values to be hashed 

are biased because of the many words with lengths of 6 to 


9 letters. 


3-3 Observation Summary 


The basic observations from the work for this 
thesis are: 


lL. Duplication can only bey economically avoided 
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in small vocabularies. 


In the MARC tapes and Canadian Dictionary 
vocabularies there are duplicates that are 
valid synonyms. This is because they contain 
imbedded spaces and special characters. 
Modification of the techniques makes a 
definite improvement in the generation of 
unique abbreviation codes. 

When making a single variation to a technique, 
the inclusion of a check digit is aneraee to 
the inclusion of the length code. 

In most cases the inclusion of the length and 


check digit decreases or does not affect the 


number of duplicates compared to the check digit 


alone. 

If the technique employs the location or 
position of the Levters selected, the inclu— 
SVOnBOtmeChetne andacheck dileit 1s. not: an 
improvement over including a check digit. It 
is however, a definite improvement when the 
location is not considered. 

A technique which relies on indices of letters 
in a word can be improved through ordering the 
letters by selection rather than dropping out 
Of slecvtensr. 

The letter/position technique performs well 


for the small vocabularies (less than 11,000 
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words) but in a large vocabulary the effect 
of the position is decreased because of the 
increased letter frequencies. 

A duplicate has been defined as two words 
which are spelled differently but generate the 
same "code.  The-eitfectiveness of the tech— 
Niguess tseereatly improved if this definition 
is relaxed to permit obvious misspellings and 
words of the same connotation. 

Word stemming and abbreviation techniques 
complement each other. A stemming method can 
eliminate from a vocabulary the words which 
ere the most difficult to abbreviate. An 
unmodified abbreviation technique can assist 
a stemming method in determining the number 
endyvarietvy Ot Sufrixes in the stem list. 

The vocabularies can be very effectively 
sectionalized using a hashing technique. A 
very even distribution results when the hash 
is done by summing the five bytes of the code 
and using the low order eight bits of the 


result as a key. 
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CHAPTER IV 


CONCLUSIONS AND RESEARCH EXTENSIONS 


4.1 Conclusions 

The objectives of this thesis, with the exception of 
decoding as discussed in Section 2.1.3, have clearly been 
realized. As the observations point out there is an algorithm 
to abbreviate English words for use by a computer. The 
storage and handling of pach a data base can be improved 
through coding and sectionalizing by the techniques presented 
iat s) tnesis. 

Compared to using the words in EBCDIC, the saving 
can range from 25% to 41$ (five bytes compared to 6.7 or 8.4 
bytes per word). The saving ranges from 6% to 25% if the 
letters are translated to a six bit representation. If this 
is done and the words are maintained on a byte boundary, a 
three letter word requires three bytes to contain the 18 bits 
however, a four letter word can be contained in the same 
space. Hence the storage needed to contain the MARC tapes 
vocabulary becomes 34074 bytes and the Canadian Dictionary 
requires 72405 bytes. This is compared to 31770 (6354 x 5) 
bytes for the abbreviated MARC tapes codes and 54020 (10804 
x 5) bytes for those of Canadian Dictionary. An additional 
saving results in continuous text processing as the need for 
a word delimiter is eliminated with the fixed length code. 


The handling of the word is far easier with the 
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fixed length code and the sectionalizing as described in 
Observation 3.2.11, reduces the data base to a more easily 
managed group of data blocks. This is valuable internally 
and also for external storage. 

As pointed out in Observation 3.2.9, a different 
algorithm should be used depending on the use of abbre- 
viation code. Regardless of which algorithm is used the 


sectionalization is performed in the same manner. 


4.1.1 An Algorithm for Words to be Used For 
Information Content 

The letter/position technique should be used and 
should be preceded by’a stemming algorithm based on stem 
Hist for the tparticularevoeabubaryed This stemelistecan 
be compiled from the list of duplicates created when the 
letter/position technique is employed with no modifications. 
The abbreviation should be done by either the technique 
With check digit@or thettechniqueswithiiength and.check 


GLeLUs 


4.1.2 An Algorithm for Abbreviating Words to be 
Used as Unique Entities 
With the definition of duplication rigidly fixed 
as two words spelled differently but generating the 
same abbreviation code, there are two possible algorithms. 
A review of Table 3.1.1 indicates for the largest vo- 


cabulary, Chemical Titles, that either selective drop- 
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out of every second character and a check digit or bigram 
selection with length and check digit should be used. 
There should be no stemming algorithm but the data can be 
sectionalized by the byte summation and use the last eight 


bits method. 


4.1.3 System Contraints 
When an algorithm is to be employed certain 
conditions are imposed. 

be sthewvocabulary is sinitially passed to provide 
a check Jist of duplicates. Very seldom will 
a system permit the duplications generated by 
the algorithm to remain unresolved. Hence he 
1isG must be created and made internal to the 
System so these words can be altered. 

on Li Une vocabularyels to be permitted to grow, 
a sysvem Of checks should be built in to edit 
for duplication. This could be a simple 
system of logging the words used to access the 
data base for later checking. A more sophis- 
ticated system would be a random or regular 
verification of the actual spelling of the 
word accessing the data base and the word in 
the data base. To do this properly would 
require a statistical analysis of search words 


and an improvement of the verification frequency 
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based on the number of new duplications 
aneCnnierent Of course words not found in the 
data base can be added but verification of 
this data should be provided by human editing. 
5." Because speca ofenandline*isalways*a-factor 
in processing large volumes of data, the 
selective dropout technique should be used 
Whenever possible. “Uhis* technique simply scans 
the word to determine the length, and passes 
over the word as many times as necessary to drop 
characters. The letter/position technique 
scans the word with two additional instructions 
to create a frequency table, then backward 
Scans vune= Word co ade the second frequency to 
create the indices, and then ranks the: indices. 
The bigram selection has the same time loss 
in ranking the indices but is able to create 
the indices with a single sean of the word with 


four additional instructions. 


4,2 Research Extension 

There are three areas of research which might 
improve the algorithms used in this study. Although improved 
discrimination of generated codes is of debatable value, 
further research, as outlined below, might decrease the 


number of duplicates which could lead to a smaller standard 
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-1 Improvement on Letter/Position Technique 


As noted earlier in Observation 3.2.8, this 


technique seems to degenerate on large vocabularies. This 


appears to be caused by the increased distance between 


indices due to larger letter frequencies. There are two 


possible means of preventing this: 


1. 


Limit the number of words to be used in 
establishing the letter/position matrices. This 
could be done by many sampling processes. 

Care is needed to insure a good representation 
of the vocabulary but maintain a reasonable, 
Silgtredeovance DYeuween, Une frequencies of lIévvers 
WLC Creave Une indices, 

Normalize the values of the matrices with 
respect to either the position or the letter. 

In truth the frequency of the letter in a cer- 
tain position does not contain the full degree 
OiecueiatTcancs OletnaLeleuvver tn Chav pOsiuton. 
More significance could be gained by asking - 
does the appearance of this letter in this 
position have any extra significance in this 
position rather than any other position? A 
measure of this would be to normalize the 
frequency of appearance of each letter in each 


position by dividing the frequency in a position 
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is 
by the total appearance OL the letter. tis 


creates, a value An each element which can be 
interpreted - regardless of the vocabulary 
SeceuWwieteuliceLevLeriS used LU. occurs this 
percentage of the time in this position 
(percentage if normalization is multiplied by 
LO OD 
The method of normalizing with respect to 
position would answer - does this appearance of 
tolsomaueover Jn wns position have any extra 
significance because 1t is this letter and not 
one of the others? This normalization would be 
performed by dividing the eee DY sthentopal 
number of letters appearing in that position. 
Poel ceOrire culty GOnSbpeCulave What effect .cither 
of these normalizations would have without testing to see 
if they would improve the abbreviation technique. Letters 
which have a high or low frequency of occurrence would 
probably not change very much in their use as indices. The 
reduced size between indices could alter the significance 
of letters in the middle frequency range and consequently 


alter the rank of letters selected for coding an abbreviation. 


4.2.2 Improvement of Bigram Technique 
While it would be advantageous to employ a 


trigram technique the size of the matrix (26 x 267) makes 
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Gina MOste DLrOnIOLy Vemuri mal FernabLlve to this could be 
a technique which uses the normal bigram matrix plus a 
matrix which contains the frequency of a letter appearing 
relacive to letters two positions away. 
For example: 
The normal bigram pairs for the word HOUSE are: 
Space H, HoOw.0U, Uo, of, and ibspace 
tne Seconda matrix would-consider the bigram pairs for 
HOUSE to be: 
Space m, jspaceo, HU, OS, UH, S space, and Espace 
The indices would be generated the same as for the 
bigram technique except there would be four factors creating 


the indices. 


4.2.3 Improving a Technique Which Ranks Indices 
Created From Frequency Counts 

he vilustraved in Observations 3.2.5 and 3.2.6 
the inclusion of one more character in the abbreviated 
code can often eliminate a duplicate. One way to perform 
Gites oudd be to delete the lowest index and include iv 
in check Gieit. Thie would permit. the inclusion of one 
more character in the abbreviation code. This would not 
prevent duplication remaining if the two new characters 
that were added to the codes were the same and the ‘two 
deleted were the same. It would eliminate the duplication 


if there was any differences among the characters with one 
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exception: 

1. The two codes are the same except for the 
check digit and the first letter. 

2a. The letters to be taken from the check digit 
to the abbreviation are the same. 

3, sAddition of the letters taken from the 
eaboreviats Onuato pnescheckidigeits result.in 
equal check digits. 

It is impossible to speculate on the results this 
change would have without detailed knowledge of the 
emaracters to be en aera Aa CHemCOCCS a tits COULGd OnLy, be 
tested and measured for duplications generated to determine 


its effect on abbreviation techniques. 
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